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1.  Introduction 

Investigators  engaged  in  Army  research  and  development  are  constantly  confronted  with 
the  need  to  make  comparisons;  e.g.,  compan<ion  of  penetration  measurements  taken  on 
different  types  of  armor  with  penetration  measurements  for  rolled  homogeneous  armor; 
comparison  of  data  collected  on  penetrators  with  differing  metalurgical  properties  with  data 
for  an  existing  penetrator;  comparison  of  wear  data  for  several  styles  of  militaiy  boot  with 
that  for  the  current  issue.  All  these  situations  share  a  common  data  structure:  s  sets  of 
observations  Yjj, ...  i-1, ... ,  s,  taken  on  processes  of  like  form,  need  to  be  compared 

against  a  reference  set  Yj, ...  ,Y^  .  Because  this  data  structure  is  so  prevalent  in  the  social  and 
health  sciences  the  reference  set  Y., ...  ,Y_  has  come  to  be  called  a  baseline,  or  control,  and 
the  experimental  data  sets  Y^j, ...  ,Y.^,  i  =  l, ... ,  s,  are  called  treatments.  We  will  adopt  this 
well-established  terminolog>'. 

Suppose  then  that  s  treatments  are  to  be  compared  to  a  control.  The  purpose  of  the 
comparison  is  to  determine  which  treatment(s),  if  any,  offer  an  improvement  over  the  control. 
Often,  improvement  is  reflected  by  the  tendency  of  a  treatment  to  take  on  larger  values  than 
the  control.  (The  sense  of  the  inequality  can  always  be  reversed,  since  min  f(x)  =  max  -f(x).) 
Standard  techniques  for  analysis  of  these  data,  broadly  described  as  multiple  comparison 
procedures  or  simultaneous  statistical  inference,  are  developed  under  several  basic  premises, 
some  of  which  may  be  difficult,  if  not  impossible,  to  meet.  Scrutiny  of  the  underlying 
assumptions  will  be  maintained  in  ihe  sections  to  follow. 

In  Section  2  we  consider  a  normal  theory  approach  to  the  problem;  in  Section  3 
nonparametric,  or  distribution-free,  approaches  are  described;  in  Section  4  applications  of 
randomization  procedures  to  ballistic  data  are  given;  and  in  Section  5  we  state  our 
conclusions. 

2.  Normal  theory  approach 

A  normal  theory'  approach  to  analysis  of  the  data  described  in  Section  1  requires  three 
basic  assumptions.  The  first  is  that  the  s  treatment  data  sets  and  the  control  are  all  samples 
from  normal  populations.  The  second  is  that  the  samples  are  random  samples,  loosely 
meaning  that  they  are  highly  representative  of  the  populations  from  w'hich  they  came;  and 
third,  a  homogeneity  of  variance  assumption  is  made -the  population  variances,  although 
unknown,  are  identical. 

In  their  aggregate,  these  three  assumptions  remove  much  of  the  difficulty  from  the 
problem.  What  remains  is  s-t- 1  identical  normal  populations  and  a  question  of  their  location 
on  the  real  line;  i.e.,  what  are  their  means?  The  random  sample  assumption  is  then  invoked 
to  facilitate  estimation  of  the  means. 

When  these  assumptions  can  be  justified,  the  normal  theory  approach  is  unsurpassed. 
When  one  or  more  of  the  assumptions  cannot  be  justified  the  normal  theory  approach  may  be 
unwarranted,  although  arguments  of  robustness  may  be  made  by  the  statistician  who  has  no 
alternative  to  offer. 
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The  normality  assumption  still  admits  some  flexibility,  supported  in  part  by  the  powerful 
Central  limit  Theorem  [2],  and  may  sometimes  serve  as  an  adequate  model  for 
measurements  on  either  discrete  or  continuous  random  variables.  The  random  sample 
assumption  is  usually  more  unrealistic,  since  a  random  sample  is  difficult  to  obtain  in  the  best 
circumstances  -  and  ballistic  data  is  rarely  collected  in  the  best  circumstances.  Homogeneit\ 
of  variance  is  a  recalcitrant  assumption,  difficult  to  supplant  in  this  data  framework  if  relaxed, 
and  so  it  will  be  retained  in  what  follows. 

In  addition  to  questionable  compliance  with  the  normal  theory  assumptions,  ballistic  dam 
often  presents  further  problems  for  the  data  analyst,  a  principal  one  being  small  sample  size. 
Samples  of  size  3,  4,  5, ... ,  cannot  be  modeled  with  confidence  by  any  distribution,  unless  the 
experimenter  possesses  concomitant  information  beyond  the  data  itself.^  For  these,  and  other 
less  compelling  reasons,  we  w'ill  not  pursue  the  well  documented  normal  theory  approach  any 
further  here. 

3.  Nonparametric  approach 

In  a  nonparametric  approach  to  the  problem,  the  normality  assumption  (and,  indeed,  any 
specific  distribution  assumption)  is  removed;  the  requirement  that  both  treatment  and 
control  groups  be  random  samples  will  be  retained  temporarily. 

3.1  Rank  tests 

\  direct  nonparametric  approach  to  the  situation  described  in  Section  1  is  to  repeatedly 
apply  Wilcoxon  rank-sum  tests  [8]  for  pairw'ise  comparison  of  the  s  treatments  with  the 
control.'  For  each  treatment  i,  the  n.  observations  on  the  i "  treatment  and  the  n^^  control 
observations  are  combined  and  ranked  by  arranging  their  values  from  smallest  to  largest  and 
assigning  the  integer  k  to  the  smallest  element,  k=l,  ...  ,  i\)+nj.  If  tied  ranks  occur,  they 
are  averaged.  The  i'*^  treatment  is  declared  superior  to  the  control  if  its  rank-sum 

n, 

R,  = 

j=i 

is  sufficiently  large;  that  is 

R,>c,  (3.1.1) 

where  the  notation  R(Y-j)  represents  the  rank  assigned  to  the  treatment  observation  . 

The  procedure  for  comparison  of  treatment  and  control  remains  incomplete  until  the 
parameters  c^  in  (3.1.1)  are  specified.  Under  a  null  hypothesis  Hj  of  no  treatment  effect  for 


1.  For  example,  it  has  been  historically  established  that  small  arms  fire  on  a  vertical  target  generates  a 
pattern  of  impact  locations  which  is  modeled  well  with  a  bivariate  normal  distribution. 

2.  The  Wilcoxon  rank-sum  test  is  also  known  as  the  Mann-Whitney  test,  since  an  equivalent  procedure 
appears  in  the  literature  under  both  names. 
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the  treaunent,  may  be  determined  to  satisfy  the  relation 
Ph(R,>c,)  =  a 


(3.1.2) 


where  is  an  acceptable  error  of  rejecting  a  valid  null  hypothesis.  Since  can  assume  only 
(nQ  +  n^jl/nQlnj!  distinct  values  (disregarding  tied  ranks),  only  a  limited  choice  of  values  for 
is  available  in  (3.1.2).  With  this  caveat,  when  nj  =  •••  =  n^,  at  the  discretion  of  the 
investigator,  ,  and  hence  c. ,  may  be  chosen  independently  of  i  and  (3.1.2)  becomes 

Ph  (R,  >  c)  =  a  Vi 

The  probabilities  a(a)  may  be  thought  of  as  measuring  the  frequency  of  false  significance 
statements  in  a  large  number  of  comparisons  betw'een  the  i  treatment  and  control  when  11^ 
of  no  treatment  effect  is  true.  This  is  sometimes  referred  to  as  the  error  rate  per  comparison. 
This  suggests  that  for  a  fixed  value  of  a  ,  a  large  number  of  treatments  for  comparison  (a 
large  value  for  s)  will  likely  lead  to  an  erroneous  rejection  of  at  least  one  hypothesis  H..  It 
can  be  shovsn  (Lehmann  [8],p.  228)  that  this  ’ikelihood  is  maximum  when  all  the  hypotheses 

are  true.  To  accomodate  this  problem  of  multiplicity  it  is  sometimes  preferable  to 
consider  the  s  comparisons  as  a  single  entity  and  establish  an  experimentwise  error  rate. 
independent  of  s. 

If  the  s  comparisons  are  considered  a  single  entity,  and  an  accompanying  global  null 
hypothesis  H  of  no  treatment  effect  for  any  treatment  is  invoked,  then  e\adence  of  at  least 
one  treatment  effect  suffices  to  invalidate  H.  This  will  occur  when  max  >  c  with 
probability  a';  i.e., 

PH(max  R,>c)  =  a' .  (3.1.3) 

The  determination  of  c  in  expression  (3.1.3)  requires  an  involved  computation, 
parameterized  by  the  treatment  sample  sizes  n,, . . . ,  n^^,  the  control  group  size  n^,  the 
number  of  treatments  considered,  s,  and  the  choice  of  a'(experimentwise  error  rate).  The 
parameters  may  still  be  manageable;  w'hat  is  unmanageable  is  the  probability  structure 
relating  the  rank-sums  R.,  i  =  1, ... ,  s,  which  must  be  incorporated  into  the  computation.  The 
rank-sums  R^  are  determined  using  a  common  control,  and  are  thus  stochastically  dependent 
in  some  fa.shion  which  usually  cannot  be  determined,  limited  computational  results  on 
(3.1.3),  after  several  simplifying  assumptions  have  been  made,  are  provided  by  Steel  [13]  and 
Miller  [10]. 

This  approach  has  been  followed  to  the  point  of  intractable  compulation.  As  mentioned 
at  the  onset,  distribution  assumptions  were  removed,  but  the  random  sample  assumption  w'as 
retained.  We  now  turn  our  attention  to  a  second  nonparametric  technique  which  holds  both 
theoretical  and  practical  appeal,  and  which  we  advocate  for  a  large  class  of  ballistic  data 
analyses. 


32  Randomization  tests 


In  a  second  nonparametric  approach  and  the  main  topic  of  this  paper,  both  distribution 
assumptions  and  the  random  sample  requirement  are  removed.  The  penalty  paid  is  that 
inferences  are  limited  to  the  data  actually  considered;  generalization  to  a  conceptual 
population  cannot  be  made,  since  the  samples,  which  are  i.o  longer  random,  are  not 
neccessarily  representaiive  of  some  larger  population. 

The  penalty,  however,  is  not  a  particularly  heavy  one.  According  to  Edgington  [3],  "Few 
experiments  in  biology,  education,  medicine,  psychology,  or  any  other  field  use  randomly 
selected  subjects,  and  those  that  do  usually  concern  populations  so  specific  as  to  be  of  little 
interest.  ...  The  population  of  interest  to  the  experimenter  is  likely  to  be  one  that  cannot  be 
sampled  randomly." 

Randomization  was  a  pparently  first  suggested  by  Fisher  [4]  a^d  extended  to  nonrandom 
samples  by  Pitman  [11],  Although  a  conceptually  simple  and  fundamentally  sound  procedure, 
it  has  not  been  fully  utilized  by  applied  statisticiaiis.  The  reason  for  this  is  likely  due  to  the 
fact  that  the  procedure  is  computaiionally  intensive  compared  to  its  parametric  counterparts; 
however,  those  counterparts  (t-tesc,  F-test,  etc.)  may  be  valid  only  to  the  extent  that  they 
approximate  the  results  given  by  the  randomization  procedure.  The  ideas  behind  a 
randomization  test  are  best  conveyed  by  er^mple. 

Suppose  that  we  have  the  following  measurements: 

Table  1.  Measurements  on  a  Single  Treatment  vs.  Conirv:.! 


control 

treatment 

7.01 

7.72 

7.37 

7.62 

6.81 

7.29 

A  cursory  examination  of  the  data  reveals  that  the  treatment  group  contains  most  of  the 
larger  values.  As  a  matter  of  fact,  if  the  control  and  treatment  data  were  combined  and 
ranked  as  in  the  Wilcoxon  test  procedure  detailed  in  Section  3.1,  ranks  3,  5,  and  6  would  be 
assigned  to  treatment.  Whether  this  is  sufficient  to  assert  the  superiority  of  treatment  over 
control  remaias  to  be  established  quantitatively. 

We  will  choose  as  a  null  hypothesis  that  the  treatment  is  ineffective  and  has  no  impact  on 
the  measurements  recorded;  m  other  words,  the  experimental  imits  wo  ilci  have  provided  the 
san  2  measurements  regardless  of  whether  they  were  assigned  to  the  treatment  o'  the  control 
group. 

Under  the  null  hypothesis  of  no  treatment  effect  or,  equivalently,  that  wherever  an 
experimental  unit  is  assigned,  its  measurement  goes  with  it,  the  labels  "control"  and 
"treatment"  become  completely  arbitrary.  Hence,  any  three  measurements  might  be  called 
control  and  the  remaining  three  measurements  treatment;  i.e.,  assignment  to  treatment  and 
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control  may  be  randomized.  The  number  of  ways  this  assignment  can  be  made  is  the  number 
of  ways  three  objects  can  be  chosen  from  six  distinguishable  objects  -  ^C^=20.  In  Table  2  is 
listed  the  sum  of  the  treatment  values  corresponding  to  the  twenty  assignments.  We  have 
summed  the  actual  measurements,  rather  than  their  ranks  as  was  done  for  the  Wilcoxon  test, 
for  technical  reasons  beyond  the  scope  of  this  paper,  although  the  interested  reader  is 
referred  to  Lehmann  and  Stein  [9],  and  Hoeflfding  [5],  for  information  on  scoring.^ 

The  important  point  is  that,  under  the  null  hypothesis,  the  treatment  sum  actually 
observed,  22.63,  could  be  exceeded  by  only  one  other  assignment.  This  means  that  the 
probability  of  observing  values  as  large,  or  larger  than  22.63  when  the  treatment  is  ineffective 
as  assumed,  is  2/20  =.10  .  This  probability,  .10,  is  alternately  called  the  p-value  or  the 
observed  significance  level  in  statistical  literature.  Whether  this  probability  of  occurrence  is 
sufficiently  small  to  suggest  rejection  of  the  null  hypothesis  must  be  decided  by  the 
experimenter,  and  his  tolerance  for  error. 

Table  2.  Sum  of  Treatment  Values  Corresponding  to  All  Data  Permutations 


treatment  sum 

21.11 

21.92 

21.19 

22.00 

21.44 

22.02 

21.47 

22.10 

21.54 

22.15 

21.67 

22.28 

21.72 

22.35 

21.80 

22.38 

21.82 

22.63 

21.90 

22.71 

The  randomization  test  compares  treatment  data  with  control;  specifically,  it  asks  how 
unusual  are  the  observed  treatment  values  if  there  is  no  difference  between  treatment  and 
control.  The  parametric  analog  of  the  randomization  test  in  this  instance  is  the  t-test.  The  t- 
test  assumes  both  treatment  and  control  measurements  are  random  samples  from  normal 
populations  that  differ  at  most  by  their  location  on  the  real  line,  and  looks  for  difference  in 
location  (means).  For  these  data,  the  t-test  provides  an  observed  significance  level  of  .042, 


3.  As  a  technical  fact  and  an  historical  aside,  nonparametric  rank  tests,  including  the  Wilcoxon  test, 
were  developed  as  a  compromise  between  the  then-computationally-burdensome  randomization 
test  and  the  alternative  -  a  fully  parameterized  model. 
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suggesting  that  the  treatment  may  indeed  be  efifective.  Compare  this  value  to  the 
randomization  test’s  .10  .  Tne  t-test  is  inappropriate  for  these  nonrandom  data,  but  is 
commonly  (mis)used. 

4.  Examples 

Example  4.1 

Table  3  contains  measurements  of  spin  rates  of  long  rod  penetrators  taken  by  Rapacki 
[12].  The  natural  frequency  of  the  penetrators  is  about  120  cycles  per  second  (hz).  Spin  rates 
close  to  this  value  amplify  the  initial  manufacturing  imperfections  and  increase  in-flight 
bending.  To  avoid  this,  different  fin  configurations  were  designed  to  reduce  the  spin  rate 
below  120  hz. 


Table  3.  Comparison  of  Two  Fin  Redesigns  with  a  Control 


initial  design 

redesign! 

redesign^ 

163.6 

97.5 

78.1 

109.0 

122.2 

76.7 

218.7 

108.2 

88.5 

143.2 

169.5 

A  plot  of  the  data  is  usually  a  good  beginning: 


240. 
210 
180-| 
150-1 
hz  120 

90-1 

60 

30-1 

0 


initial  design 


X 


redesign  I 


redesign  2 


Figure  1.  Plot  of  Spin  Rate  Data 
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It  appears  visually  that  the  fin  reconfiguration  had  the  intended  effect -reducing  the  spin 
rate.  TTie  role  of  statistics  here  is  to  see  if  this  observation  has  quantitative  support.  Toward 
this  end,  we  will  randomize,  choosing  as  a  null  hypothesis  that  the  two  redesigns  provide  no 
improvement  over  the  initial  design.  K  this  is  true,  then  the  eleven  observations  may  be 
arbitrarily  partitioned  into  groups  of  5,  3,  and  3,  and  Table  3  is  simply  one  of 
11!  /5I3I3!  =  9240  possible  data  configurations. 

In  this  example  we  return  more  closely  in  format  to  the  situation  described  in  the 
Introduction-  We  also  remain  mindful  of  the  two  sources  for  error  discussed  in  Section  3.1, 
error  rate  per  comparison  and  experimentwise  error  rate.  Experimentwise  error  rate  will  be 
controlled  through  the  use  of  a  multiple  comparison  procedure,  a  technique  adjunct  to  the 
main  topic  of  this  report.  An  advanced  reader  may  wish  to  consult  Winer  [15],  Keppel  [7],  or 
Bancroft  [1]  for  details  and  guidance  in  selecting  an  appropriate  method.  We  will  choose 
here  a  multiple  comparison  procedure  known  as  Fisher’s  modified  least  significant  difference 
(Winer  [15],  p.l99)  which  has  the  desirable  properties  of  being  both  nonparametric  and 
applicable  to  unequal  sample  sizes. 

Suppose  we  specify  the  largest  tolerable  experimentwise  error  rate  to  be  a'  =  .05  for 
multiple  comparison  of  the  two  fin  redesigns  with  the  control.  Adopting  the  obvious  notation 
c,  dl,  d2  for  control  and  redesign,  we  are  interested  in  the  comparisons  c-dl  and  c-d2.  The 
observed  significance  level  is  determined  for  each  of  the  pairwise  comparisons  following  the 
randomization  procedure  just  illustrated,  where  for  each  comparison  only  the  data 
corresponding  to  the  paired  entities  are  permuted.**  Each  p-value  is  then  multiplied  by  two 
(the  number  of  comparisons)  in  accordance  with  Fisher’s  procedure  to  obtain  an  adjusted  p- 
value.  The  p- values  and  adjusted  p-values  for  comparison  of  c-dl  and  c-d2  are  given  in 
Table  4. 


Table  4.  Multiple  Comparison  of  Control  and  Two  Treatments 


comparison  c-dl  c-d2 

p-value  .036  .018 

adjusted  .071  .036 

p-value 


The  adjusted  p-value,  .036,  corresponding  to  comparison  of  control  and  redesign2,  falls 
well  below  the  a'  =  .05  value  chosen  for  experimentwise  error  rate,  and  reflects  a  statistically 
significant  difference  between  the  items  compared.  Comparison  of  control  and  redesignl, 
with  an  adjusted  p-value  of  .071,  exceeds  a' =.05,  and  does  not  substantiate  a  claim  of 
difference.  These  conclusions,  now  quantified,  are  consistent  with  the  display  in  Figure  1. 


4.  We  are  testing  here  a  restricted  null  hypothesis.  It  focuses  attention  on  the  comparisons  of  interest 
while  rasing  the  overall  computational  burden. 
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Another  fundamental  attribute  of  randomization  tests  that  enhances  their  value  is  the 
ability  to  adapt  to  virtually  any  test  statistic  that  seems  appropriate.  Thus  far  attention  has 
been  restricted  to  statistics  sensitive  to  increased  (decreased)  treatment  response  when 
compared  to  a  control;  in  the  following  example,  we  will  consider  instead  relative  dispersion. 

Example  42 

Table  5  lists  measurements  of  horizontal  displacement  of  centers  of  impact  of  three  round 
shot  groups  from  a  reference  aim  point,  report^  by  Webb,  et.al.  [15].  Each  three  round  shot 
group  was  fired  from  a  different  120  mm  tank  gun  tube.  The  gun  tubes  were  indexed 
(calibrated)  by  a  standard  method  or  by  an  alternative  method  based  on  the  dynamic 
response  of  the  tube  during  firing.  The  intent  of  indexing  is  to  improve  the  precision  of  the 
delivered  rounds.  The  question  of  whether  or  not  dvnamic  indexing  offers  an  improvement 
over  standard  indexing  was  pursued  at  the  onset  using  normal  theory  analysis  on  the  data 
sho\\Tt  in  Table  5. 


Table  5.  Measurements  of  Center  of  Impact  Displacement  (mils) 


standard 

dynamic 

0.14 

0.96 

0.22 

0.28 

0.40 

0.20 

0.15 

0.74 

0.11 

0.43 

The  measure  of  precision  chosen  was  the  variance  of  the  centers  of  impact.  This  measure 
combines  both  round- to- round  and  tube-to-tube  variability.  Webb,  et.al.  offer  justification  for 
claiming  that  round-to-round  dispersion  is  approximately  the  same  for  both  methods  of 
indexing,  leaving  any  observed  differences  in  variation  between  the  two  methods  attributable 
to  variation  between  gun  tubes. 

The  comparison  of  precision  associated  with  the  two  methods  of  indexing  can  be  made 
using  any  appropriate  statistic;  we  chose 

8^5/ S^D  where  =  E  (x^  -  x  )V(n  - 1),  (4.1) 

the  ratio  of  the  sample  variance  for  centers  of  impact  among  standard  indexed  tubes  to  that 
of  dynamically  indexed  tubes.  Large  values  of  this  ratio  indicate  smaller  variability  or 
increased  precision  of  the  dynamically  indexed  tubes  relative  to  the  standard;  a  ratio  close  to 
one  represents  roughly  equivalent  precision  for  the  two  methods.  For  the  data  displayed  in 
Table  5,  the  ratio  (4.1)  was  determined  to  be  0.13. 
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Any  reasonable  statistical  procedure  would  fail  to  support  the  notion  of  improvement  of 
dynamic  over  standard  indexing  based  on  this  data  set,  but  we  will  pursue  a  randomization 
approach  in  order  to  reinforce  an  important  point  made  earlier.  In  compliance  with  the 
paradigm  already  outlined,  and  under  a  nuU  hypothesis  Hq  of  no  difference  between  the 
indexing  procedures,  the  center  of  impact  data  are  permuted  between  tube  groups,  providing 
101/515!  =  252  arrangements  for  which  the  statistic  (4.1)  is  evaluated.  TTie  observed 
significance  level  was  determined  tc  ■; 

Pho(sVs^d^O*13)=0.85, 

suggesting  that  an  improvement  can  be  claimed  for  dynamic  indexing  only  if  the  experimenter 
is  willing  to  accept  an  85  per  cent  chance  of  misclassification  when  is  correct. 

Some  readers  will  recognize  the  ratio  (4.1)  as  Snedecor’s  F-statistic,  whose  quantiles 
under  normal  theory  assumptions  are  extensively  tabled  and  whose  values  may  be  computed 
in  most  statistical  software  packages.  It  is  worth  restating  that  without  random  sampling 
from  normal  populations,  the  F-statistic  is  useful  only  to  the  extent  that  it  approximates  the 
observed  significance  level  produced  by  the  randomization  test. 

Kempthorne  [6]  points  out  that  using  the  F-statistic  as  an  approximation  produces  a 
significance  level  that  will  sometimes  be  too  small,  and  sometimes  too  large.  In  this  example, 
reference  to  a  table  of  the  F-statistic  provides  a  p-value  of  .96.  A  discrepancy  in  p-value  of 
.  1 1  with  the  randomization  procedure  is  hardly  cause  for  concern;  the  same  decision  will  be 
made.  But  suppose  the  null  hypothesis  had  been  reversed;  i.e.,  suppose  the  intent  had  been  to 
decisively  show  that  the  standard  indexing  method  resulted  in  greater  precision.  The  p-va)ues 
for  the  F-test  and  the  randomization  test  now  become  .04  and  .15,  respectively.  An 
experimenter  interpreting  these  results  in  a  decision  theory  framework  would  likely  draw 
different  conclusions -standard  indexing  is  better  in  the  first  instance;  and,  insufficient 
evidence  exists  to  claim  that  it  is  better  in  the  second -due  simply  to  choice  of  procedure. 
Nominal  values  of  .05  and  .10  are  commonly  chosen  for  allowable  error  in  a  decision-making 
context.  The  experimenter  who  relies  on  normal  theory  and  the  F-test  as  an  approximation  is 
once  again  misled. 

5.  Conclusion 

Randomization  procedures  offer  a  viable  approach  to  the  analysis  of  ballistic  data  over  a 
wide  class  of  problems.  Distribution  assumptions  are  unnecessary  and,  of  even  greater 
importance,  random  samples  of  data  are  not  required.  Small  sample  sizes,  while  never 
welcome,  may  be  accommodated  as  well. 

In  statistics,  as  elsewhere,  there  is  no  free  lunch.  The  price  paid  for  randomization  is 
increased  computation,  since  every  problem  requires  a  tailored  solution,  reflected  through  the 
enumerative  process  required  to  determine  the  p-values.  However,  use  of  the  normal  theory 
statistics -t-test,  F-test,  chi-square  test,  etc. -may  only  be  valid  to  the  extent  that  they 
approximate  the  p-values  obtained  from  randomization. 

In  the  examples  detailed  in  this  paper,  the  p-values  were  attainable  through  exhaustive 
enumeration  and  the  tests  described  may  be  further  delineated  as  exact  randomization  tests. 
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Computing  power  still  limits  the  amount  of  enumeration  possible  however,  in  which  case 
approximate  randomization  tests,  not  considered  here,  may  then  be  appropriate. 

Reconciliation  of  some  theoretical  questions  raised  by  the  application  of  randomization 
procedures  to  ballistic  data  analysis  makes  this  an  intriguing  and  highly  practical  area  of 
research. 
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